Scalable Probabilistic Entity-Topic Modeling

نویسندگان

Neil Houlsby

Massimiliano Ciaramita

چکیده

We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memoryfrugal processing of large datasets. We report state-of-the-art performance on a public dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Scalable Gibbs Sampler for Probabilistic Entity Linking

Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with th...

متن کامل

Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms

Spectral methods offer scalable alternatives to Markov chain Monte Carlo and expectation maximization. However, these new methods lack the rich priors associated with probabilistic models. We examine Arora et al.’s anchor words algorithm for topic modeling and develop new, regularized algorithms that not only mathematically resemble Gaussian and Dirichlet priors but also improve the interpretab...

متن کامل

Grounding Topic Models with Knowledge Bases

Topic models represent latent topics as probability distributions over words which can be hard to interpret due to the lack of grounded semantics. In this paper, we propose a structured topic representation based on an entity taxonomy from a knowledge base. A probabilistic model is developed to infer both hidden topics and entities from text corpora. Each topic is equipped with a random walk ov...

متن کامل

Mixed Membership Word Embeddings: Corpus-Specific Embeddings Without Big Data

Word embeddings provide a nuanced representation of words which can improve the performance of NLP systems by revealing the hidden structural properties of words and their relationships to each other. These models have recently risen in popularity due to the successful performance of scalable algorithms trained in the big data setting. Consequently, word embeddings are commonly trained on very ...

متن کامل

Mixed Membership Word Embeddings for Computational Social Science

Word embeddings improve the performance of NLP systems by revealing the hidden structural relationships between words. These models have recently risen in popularity due to the performance of scalable algorithms trained in the big data setting. Despite their success, word embeddings have seen very little use in computational social science NLP tasks, presumably due to their reliance on big data...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1309.0337 شماره

صفحات -

تاریخ انتشار 2013

Scalable Probabilistic Entity-Topic Modeling

نویسندگان

چکیده

منابع مشابه

A Scalable Gibbs Sampler for Probabilistic Entity Linking

Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms

Grounding Topic Models with Knowledge Bases

Mixed Membership Word Embeddings: Corpus-Specific Embeddings Without Big Data

Mixed Membership Word Embeddings for Computational Social Science

عنوان ژورنال:

اشتراک گذاری